Jess Goddard, Rich Pauloo, Ryan Shepherd, Noor Brody

Last updated 2022-07-01 14:57:02


1 Introduction

In this report, we present a nationwide water service spatial boundary layer for community water systems and explain the tiered approach taken towards this end. The water service boundary layer produced in this report should not be interpreted as a final product, but rather, one that may be improved with the assimilation of additional water service boundaries from states and improvements to the methodology.

This report builds on a previous technical memorandum (eda_february.html) that should be read as prerequisite material to understand the broader context, background, and exploratory data analysis (EDA) that informs the approach taken herein. The resulting national water service boundary layer is the product of a “Tiered Explicit, Match, and Model” (henceforth, TEMM) approach. The TEMM is composed of three hierarchical tiers, arranged by data and model fidelity.

  1. Whenever present, we use explicit water service boundaries (labeled data). These are spatial polygon data, typically provided at the state-level.
  2. In the absence of explicit water service boundary data, we use a matching algorithm to match PWSIDs by spatial intersection and name to TIGER place polygons. When a water system and TIGER place match one-to-one, we label this Tier 2a. When multiple water systems match to the same TIGER place, we label this Tier 2b. Tier 2b is now removed from this report, but we indicate Tier 2a to show that we no longer have Tier 2b – future updates will refer to these as Tier 2. We validate this heuristic approach on explicit water service boundaries1, and find favorable results consistent with findings by Patterson et al. (2022), who showed that Census TIGER Places generally agree with municipal boundaries.
  3. Finally, in the absence of an explicit water service boundary (Tier 1) or a TIGER place polygon match (Tier 2a), a statistical model trained on explicit water service boundary data (Tier 1) is used to estimate a reasonable radius at provided PWSID centroids, and model a spherical water system boundary (Tier 3). Centroid accuracy issues that remain intractable at this point in time are discussed in the Limitations and Recommendations sections.

Below is a conceptual diagram of the three tiers in the TEMM approach. Tier 1 explicit boundaries always supersede Tier 2 matched proxy boundaries, which in turn always supersede Tier 3 modeled boundaries. Thus, the resulting the resulting water service boundary layer described in this report is combination of all three tiers, and depends on data availability for the water system, and whether or not it matches to a TIGER place.

Conceptual diagram of water service boundary tiers

Figure 1.1: Conceptual diagram of water service boundary tiers

In the sections that follow, we summarize our approach for each of the three tiers. As we move from Tier 1 to Tier 3, uncertainty increases, hence, we describe each Tier with increasing detail and provide measures of validation and uncertainty for Tier 2 and 3 estimates.

Finally, we encourage the reader to consider the TEMM water service boundary layer not as a final product, but rather, one that may be improved with the assimilation of additional Tier 1 explicit data, improvements to the Tier 2 matching algorithm, and refinement of the Tier 3 model. Ultimately, this entire workflow may be superseded by nationwide Tier 1 data, which would reduce the problem we address in the scope of work to one of simple Tier 1 data assimilation and cleaning.


1.1 Data Sources

Data sources for the TEMM layer are shown below.

Data Source Description Link
EPA Safe Drinking Water Information Systems Public water systems “master list” with key attribute data and some tabular geographic data like cities served SDWIS MODEL
EPA Enforcement and Compliance History Online Public water system facilities archive, drawing lat/long data for facilities centroids from FRS ECHO Exporter
EPA Facilities Registry Service FRS regularly updates facilities data with lat/long information, which pipes into ECHO FRS Geospatial
US Census Bureau TIGER/Line (also called “TIGER Places”) US Census data of places–cities and towns–used to identify potential service area boundaries TIGER/Line Shapefiles accessed using R package tigris
Unregulated Contaminant Monitoring Rule UCMR 3 and UCMR 4 provide data on pwsid and zip codes served, which can provide higher integrity centroids where needed UCMR Occurence data
Homeland Infrastructure Foundation-level Data Mobile home park centroids MHP centroids
Labeled Water System Boundaries URLs on state pages for various water service boundary sources State URL tracking doc


2 Report outline

library(tidyverse)
library(sf)
library(fs)
library(rcartocolor)
library(geofacet)
library(mapview)

# don't allow fgb streaming: delivers self-contained html
mapviewOptions(fgb = FALSE)

# load environmental variable for staging path 
staging_path <- Sys.getenv("WSB_STAGING_PATH")
output_path <- Sys.getenv("WSB_OUTPUT_PATH")

# read TEMM spatial output for key resul summary stats below
temm <- dir_ls(path(output_path, "temm_layer/"), glob = "*geojson") %>% 
  # get the most up to date temm spatial geojson layer if there are many
  sort() %>% rev() %>% .[1] %>% 
  st_read(quiet = TRUE) %>% 
  # drop this column because it's read as a list and causes mapview trouble
  select(-service_area_type_code)

# Tier 1 labeled data
wsb_labeled_clean <- path(staging_path, "wsb_labeled_clean.geojson") %>% 
  st_read(quiet = TRUE)

# read the clean matched output and perform minor transforms for plots.
# this has the same number of rows as `temm`, but contains Tier 1 "radius"
matched_input <- read_csv(path(staging_path, "model_input_clean.csv")) %>%
  mutate(
    # transform radius from m to km
    radius = radius/1000) 

# combine to get complete list
d <- matched_input %>%
  left_join(temm %>%
              select(pwsid, tier), on = "pwsid") %>%
  select(-geometry)

# total water system count where pop available
nrow_d <- nrow(d)

# population served by all water systems
# Note there are systems with missing population, ignore
pop_total <- sum(d$population_served_count, na.rm = TRUE)

# calculate count and proportion of people served by each tier
pop <- d %>% 
  filter(!is.na(population_served_count)) %>%
  group_by(tier) %>% 
  summarize(
    count = format_bign(sum(population_served_count)),
    prop  = round((sum(population_served_count)/pop_total)*100, 2)
  )

# read in sdwis original set of water systems for comparisons
acws <- read_csv(path(staging_path, "sdwis_water_system.csv")) %>%
  filter(pws_activity_code == "A" &
           pws_type_code == "CWS")

# number of active, community water systems in SDWIS
n_acws <- acws %>% nrow()

# population served by active, community water systems in SDWIS
pop_acws <- sum(acws$population_served_count)

# get full list of temm systems (incl those without population), "a" for all
a <- temm

# remove geometry
st_geometry(a) <- NULL

# calculate count and proportion of water systems by each tier 
sys <- a %>% 
  group_by(tier) %>% 
  summarize(
    count = n(),
    prop  = round((sum(count)/n_acws)*100, 2)
  )

The following outline reflects a summary of key findings, followed by a description of each Tier in the TEMM approach. Notably, the TEMM may be refined and re-run as new data sources are ingested or improvements are made to matching and modeling algorithms.

Key Results

  • A nationwide water service boundary layer is provided for 44,786 community water systems2 covering a total population of 307,108,553.
  • Population coverage rates per Tier, for systems with population reported:
    • Tier 1: 44.42% population covered (136,404,105 people)
    • Tier 2a: 40.91% population covered (125,652,248 people)
    • Tier 3: 14.67% population covered (45,052,200 people)
  • CWS coverage rates per Tier:
    • No Tier/Geometry: 7.01% system covered (3464 systems)
    • Tier 1: 34.22% system covered (16915 systems)
    • Tier 2a: 23.32% system covered (11526 systems)
    • Tier 3: 35.45% system covered (17519 systems)
  • The distribution of population in each of these Tiers varies by state (see state-level coverage map).
  • Favorable uncertainty metrics for Tier 2 and 3 estimates suggest that the TEMM approach provides a reasonable preliminary water service boundary layer.

Tier 1: Explicit boundaries

  • Data discovered for 14 states (AZ, CA, CT, IL, KS, MO, NC, NJ, NM, OK, PA, TX, UT, WA).
  • These data are considered “high quality” and used to model Tier 3 boundaries.

Tier 2: Matched TIGER proxy boundaries

  • 34.85% of water systems (17,225 / 49,424) match a TIGER place by spatial intersection between water system facility centroids and TIGER boundaries or water system name match to TIGER names. Of these, 11,526 systems are actually assigned the TIGER shapefile, and the remaining systems that matched to Census are modeled as Tier 3 to avoid assigning the same Census place to more than one system.
  • Spatial and name matching is constrained within the reported state to prevent out-of-state mismatches.
  • When a system has multiple spatial or name matches, minimum spatial distance between the reported PWSID centroid and the matched TIGER Place centroid is used to break ties and assign a single proxy boundary.
  • When multiple water systems match to the same TIGER boundary, a “best match” is determined through a scoring system found in the matching folder

Tier 3: Modeled boundaries

  • Models are based on centroids provided by ECHO and FRS. Where geometry accuracy is low (i.e. county centroid or state centroid), centroids of zip codes served from UCMR or mobile home park centroids are used, if available.

  • Random Forest, xgboost, and multiple linear regression models were fit to Tier 1 labeled boundaries to develop a model that estimates water system radii and spatial boundaries at all systems nationwide.

  • Linear regression models had the lowest error and best fit.

  • Circular water service boundary estimates were then calculated for all water systems; median, 5% and 95% confidence intervals reflect medium, low, and high water service boundary estimates.

  • The multiple linear regression fit has an impressive \(R^2 = 0.64\) and is mostly explained by service connection count, which intuitively makes sense: there is roughly linear scaling between service connection count and the spatial extent of a water service area. EDA, Random Forest, and xgboost models confirm the importance of service connection count as an explanatory variable.

Combine Tiers

  • After boundaries from all three Tiers are assimilated, matched, or modeled, they are combined into a final spatial data layer. As depicted in Figure 1 above, Tier 1 boundaries always supersede Tier 2 boundaries, which always supersede Tier 3 boundaries.

Limitations

  • Data and model limitations are discussed, and accompanied by recommendations in the following section.

Recommendations

  • Directions for future work to address the limitations of the TEMM approach are briefly discussed, in particular:
    • Develop an approach to refine Tier 2b polygons to disaggregate the boundary to the underlying water systems – accomplished in this latest iteration.

    • Centroid accuracy and the “pancake problem” may be improved by “re-centering” low-accuracy centroids with city/town spatial databases or acquiring new centroids from states – next steps.

    • Some water systems have a primacy agency code that conflicts with the spatial location (i.e., coordinates) of the water system. For example, water system with primacy agency code of state A, may have a centroid located in state B. This could be resolved by additional data cleaning.

Contributions

  • The significance and broader impact of the TEMM open source, nationwide water service boundary layer is briefly discussed.
  • The reader is oriented towards contributor guides and a vision for the continual development of the project is outlined.


3 Key Results

The key result of this study is a nationwide water service boundary layer. Here we show the proportion of population served by each TEMM Tier at nationwide and statewide scales.


3.1 Tier coverage - nationwide

In total, the TEMM data layer represents tap water delivery to 307.11 million people served by 44,786 water systems3. This amounts to 97.14% of the population reportedly served by active community water systems in SDWIS and 90.62% of active community water systems in SDWIS. Note that some systems report population of 0 or 1; these are wholesalers and therefore there is some inaccuracy around the total counts for population served and total population.

About 136 million people are covered by Tier 1 spatial data – impressive given that only 14 states that provided explicit boundary data. However, this relatively high coverage rate is unsurprising because these states (AZ, CA, CT, IL, KS, MO, NC, NJ, NM, OK, PA, TX, UT, WA) include notable centers of high-population like CA, TX, and PA.

Together, around 262 million people (85.33% of the population) are covered by either a Tier 1 or a Tier 2a spatial boundary. The remaining approximately 45 million people (14.67%) represented in the layer are covered by a less accurate, Tier 3 boundary. These results indicate relatively high confidence in the spatial accuracy of the resulting TEMM water boundary layer for a majority of the population served by community water systems.

pop %>% 
  kable(col.names = c("Tier", "Population count", 
                      "Population proportion (%)"), 
        align = rep("l", 3),
        caption = "Population coverage by tier (for systems with population reported)") %>% 
  kableExtra::kable_styling(full_width = FALSE)
Table 3.1: Population coverage by tier (for systems with population reported)
Tier Population count Population proportion (%)
Tier 1 136,404,105 44.42
Tier 2a 125,652,248 40.91
Tier 3 45,052,200 14.67

Together, around 28441 community water systems (57.54% of systems in layer) are covered by either a Tier 1 or a Tier 2a spatial boundary. Approximately 17519 water systems (35.45%) are covered by a Tier 3 boundary. There are 3464 systems (7.01%) of systems have no geometry associated with them and are therefore not represented in the final TEMM layer.

sys %>% 
  kable(col.names = c("Tier", "CWS\ncount", 
                      "CWS\nproportion (%)"), 
        align = rep("l", 3),
        caption = "Community water system (CWS) coverage by tier") %>% 
  kableExtra::kable_styling(full_width = FALSE)
Table 3.2: Community water system (CWS) coverage by tier
Tier CWS count CWS proportion (%)
none 3464 7.01
Tier 1 16915 34.22
Tier 2a 11526 23.32
Tier 3 17519 35.45


3.2 Tier coverage - statewide

Next, we show the proportion of population covered in the final spatial layer per TEMM Tier on a state-by-state basis. Notably:

  • When Tier 1 data is present (grey bars below), it tends to cover a majority of the population.
  • Together, Tiers 2a/2b (teal and dark blue bars below) cover more population than Tier 3 across nearly all states. In some states, Tier 2a is more common than Tier 2b.
  • Tier 3 coverage (orange bars below) is not uniform across states. This implies that the Tier 2 matching algorithm may perform better in some states, and may depend on factors that vary across states, like centroid accuracy and water system type (e.g., municipal).
# state pop from full sdwis list
state_pop <- acws %>%
  group_by(primacy_agency_code) %>%
  summarize(state_pop = sum(population_served_count))

# dataframe for barplot geofacet: population prop served by each tier
# using all systems
ag <- a %>% 
  group_by(primacy_agency_code, tier) %>% 
  left_join(state_pop, on = "primacy_agency_code") %>%
  # calculate count/proportion of population served per state and tier
  summarise(
    count = sum(population_served_count, na.rm = TRUE),
    prop  = count/state_pop
  ) %>% 
  ungroup() %>% 
  distinct() %>% 
  # add missing NA values per state code and tier group
  complete(primacy_agency_code, tier,
           fill = list(count = 0, prop = 0)) 

# sanity check the grouped summary above: this should return all 1
# group_by(ag, primacy_agency_code) %>% 
#  summarise(s = sum(prop, na.rm = TRUE)) %>% 
#  pull(s) 

# geofacet of TEMM tier coverage per state in terms of population proportion
ag %>% 
  ggplot(aes(fct_rev(tier), prop, fill = tier)) + 
  geom_col() + 
  coord_flip() +
  scale_fill_carto_d(direction = -1) +
  scale_y_continuous(breaks = c(0, 1, 0.5), 
                     labels = c(0, 1, 0.5)) +
  geofacet::facet_geo(~primacy_agency_code) +
  labs(x = "", y = "Proportion of Population Covered") +
  guides(fill = "none") +
  theme_minimal(base_size = 6) +
  theme(panel.grid.minor.x = element_blank())
TEMM Tier distribution by state

Figure 3.1: TEMM Tier distribution by state

Native American water systems are designated in their water system id (pwsid) by their respective EPA region. For a full map of EPA regions see here. Below we show that most of these systems lack Tier 1 data and have generally higher proportion of Tier 3 data compared to state primacy codes.

ag %>% 
  # filter to Native American primacy codes
  filter(! primacy_agency_code %in% c(state.abb, "DC")) %>% 
  ggplot(aes(fct_rev(tier), prop, fill = tier)) + 
  geom_col() + 
  coord_flip() +
  scale_fill_carto_d(direction = -1) +
  scale_y_continuous(breaks = c(0, 1, 0.5), 
                     labels = c(0, 1, 0.5)) +
  facet_wrap(~primacy_agency_code) +
  labs(x = "", y = "Proportion of Population Covered") +
  guides(fill = "none") +
  theme_minimal(base_size = 6) +
  theme(panel.grid.minor.x = element_blank())
TEMM Tier distribution of systems in Native American territories-by EPA agency

Figure 3.2: TEMM Tier distribution of systems in Native American territories-by EPA agency

# Data on number of native american territory systems overall vs. final set
nam_rep <- d %>% filter(! primacy_agency_code %in% c(state.abb, "DC"),
                        ! primacy_agency_code %in% c("PR", "VI", "NN", "AS", "MP", "GU")) %>% nrow()
nam <- acws %>% filter(! primacy_agency_code %in% c(state.abb, "DC"), 
                ! primacy_agency_code %in% c("PR", "VI", "NN", "AS", "MP", "GU")) %>% nrow()


In total, of the 569 systems in Native American territories nationwide, the final database after data loss due to missing centroids represents 337 (59.23%) systems in Native American territories. A data table of the the above plots is provided below.

## <table width=100%> <caption>(#tab:datatable-temm-tier-distribution-per-state)TEMM tier distribution by state</caption> </table>


3.3 Tier coverage - system size

Next, for each state, we quantify the proportion of systems in each Tier by water size based on population categories according to EPA guidelines:

tribble(
~`EPA classification`, ~`Population Size`,
 "Very Small",            "500 or less",
 "Small",                 "501 - 3,300",
 "Medium",                "3,301 - 10,000",
 "Large",                 "10,001 - 100,000",
 "Very Large",            ">100,000"
) %>% 
  kable(align = rep("l", 2),
        caption = "EPA system size categorization") %>% 
  kableExtra::kable_styling(full_width = FALSE)
Table 3.4: EPA system size categorization
EPA classification Population Size
Very Small 500 or less
Small 501 - 3,300
Medium 3,301 - 10,000
Large 10,001 - 100,000
Very Large >100,000

In states that lack Tier 1 boundaries, “Very Small” and “Small” systems tend to have higher proportions of Tier 3 boundaries (grey bars) compared to larger systems. This result is unsurprising because Tier 2 spatial matches hinge on connecting water systems to Census “Places” which tend to be more populous areas. We therefore expect that smaller systems lack a Tier 2 match, and are hence covered by Tier 3 boundaries. When Tier 1 data is present, it seems to cover all system sizes (i.e., CA, NJ, TX), with a slight bias towards covering larger systems (e.g., OK, WA, CT).

# water system size levels in table above
size_lvls <- c("Missing Pop.", "Very Small", "Small", "Medium", "Large", "Very Large")

# dataframe for barplot geofacet: prop of each each tier per size class
ag2 <- a %>% 
  mutate(
    size_class = case_when(
      # in 99% of cases, pop served = 0 or 1 indicates wholesaler, but reporting as "missing pop"
      population_served_count == 0 | population_served_count == 1 ~ "Missing Pop.",
      population_served_count <= 500 ~ "Very Small",
      population_served_count > 500 & population_served_count <= 3300 ~ "Small",
      population_served_count > 3300 & population_served_count <= 10000 ~ "Medium",
      population_served_count > 10000 & population_served_count <= 100000 ~ "Large",
      population_served_count > 100000 ~ "Very Large"),
    size_class = factor(size_class, levels = size_lvls)
  ) %>% 
  group_by(primacy_agency_code, size_class) %>%
  # total count of tiers per state and size class
  mutate(size_class_count = n()) %>%
  ungroup() %>% 
  group_by(primacy_agency_code, size_class, tier) %>% 
  # calculate count/proportion of tiers per state and size class
  summarise(
    count = n(),
    prop  = count/size_class_count
  ) %>% 
  ungroup() %>% 
  distinct() %>% 
  # add missing NA values per state code and tier group
  complete(primacy_agency_code, size_class, tier, 
           fill = list(count = 0, prop = 0)) 

# geofacet of TEMM tier coverage per state in terms of population proportion
ag2 %>% 
  ggplot(aes(size_class, prop, fill = tier)) + 
  geom_col(position = "stack") + 
  coord_flip() +
  scale_fill_carto_d(direction = -1) +
  scale_y_continuous(breaks = c(0, 1, 0.5), 
                     labels = c(0, 1, 0.5)) +
  geofacet::facet_geo(~primacy_agency_code) +
  labs(x = "", 
       y = "Proportion of Systems Covered by Each Tier, Split by System Size",
       subtitle = "Missing Pop indicates recorded population of 0 or 1, usually wholesaler",
       fill = "Tier") +
  theme_minimal(base_size = 6) +
  theme(panel.grid.minor.x = element_blank(), legend.position = c(0.95, 0.15))
TEMM Tier distribution by system size

Figure 3.3: TEMM Tier distribution by system size

Repeating the above analysis for primacy codes indicating Native American territories, we see that Native American systems are generally small or very small according to EPA size classification.

ag2 %>% 
  filter(! primacy_agency_code %in% c(state.abb, "DC")) %>% 
  ggplot(aes(size_class, prop, fill = tier)) + 
  geom_col(position = "stack") + 
  coord_flip() +
  scale_fill_carto_d() +
  scale_y_continuous(breaks = c(0, 1, 0.5), 
                     labels = c(0, 1, 0.5)) +
  facet_wrap(~primacy_agency_code) +
  labs(x = "", 
       y = "Proportion of Systems Covered by Each Tier, Split by System Size (Population)",
       subtitle = "Missing Pop indicates recorded population of 0 or 1, usually wholesaler",
       fill = "Tier") +
  theme_minimal(base_size = 6) +
  theme(panel.grid.minor.x = element_blank())
TEMM Tier distribution of systems in Native American territories-by system size

Figure 3.4: TEMM Tier distribution of systems in Native American territories-by system size


A data table of Tier categories across system sizes, nationwide, is provided below. Proportion represents the proportion of Tier within the size class.

## <table width=100%> <caption>(#tab:datatable-temm-tier-distribution-by-system-size)TEMM Tier distribution by system size</caption> </table>

A data table of the the above plots, looking at Tier counts by primacy agency code, is provided below.

## <table width=100%> <caption>(#tab:datatable-temm-tier-distribution-per-state-and-system-size)TEMM Tier distribution by system size and primacy agency code</caption> </table>


3.4 National boundary layer

The national water service boundary layer is too large (i.e., around 400 MB) to plot in this interactive report, thus we show a static map below to illustrate the coverage provided by the Tiers 1-3 in the proportions described above, and direct the reader to the TEMM spatial layer in the project repository.

include_graphics(here("docs/img/temm-nation.png"))
Nationwide TEMM water system boundary layer

Figure 3.5: Nationwide TEMM water system boundary layer

Here, we zoom into California, Nevada, and Oregon – three states with high proportions of Tier 1, 2, and 3 spatial boundaries respectively. The Tier 3 circular buffers in this interactive map represent median (50th percentile) model estimates. Click on the layers icon to toggle between Tiers.

# plot CA, NV, OR
temm %>% 
  filter(primacy_agency_code %in% c("CA", "NV", "OR"),
         tier != "none") %>% 
  select(pwsid, service_connections_count, tier) %>% 
  mapview::mapview(zcol = "tier", burst = TRUE)

Next, we review the construction of each Tier.


4 Tier 1: Explicit boundaries

Tier 1 boundaries are the most straightforward to understand: states provide explicit data, and when they do, these boundaries tend to cover nearly the entire population served by water systems. Tier 1 data missingness challenges the accuracy of downstream Tier 2 and 3 boundary estimates. At the time of writing, Tier 1 data is present for 14 states (AZ, CA, CT, IL, KS, MO, NC, NJ, NM, OK, PA, TX, UT, WA).

# summarize missing and present states in a dataframe
states_missing <- state.abb[!state.abb %in% unique(wsb_labeled_clean$state)]
states_present <- unique(wsb_labeled_clean$state)

states_df <- tibble(
  state_abbr = c(states_missing, states_present),
  status = c(rep("missing", length(states_missing)), 
             rep("present", length(states_present)))
)

# pull usa state geometries, project to input data CRS
usa <- USAboundaries::us_states(resolution = "low") %>% 
  st_transform(epsg) %>% 
  left_join(states_df) %>% 
  tigris::shift_geometry() %>% 
  # remove Puerto Rico
  filter(!is.na(status))

# map of missing states
ggplot() +
  geom_sf(data = usa, aes(fill = status), alpha = 0.5, color = "white") +
  scale_fill_carto_d(palette = "Pastel", direction = -1) +
  labs(fill = "") +
  theme_void() + 
  labs(title = "Labeled data coverage by state",
       subtitle = paste("Last updated:", Sys.Date()))
Presence/absence of explicit boundaries in US

Figure 4.1: Presence/absence of explicit boundaries in US

We assume that data provided by states is correct, and do not clean Tier 1 data beyond assimilating the data across states into a single layer.


5 Tier 2: Matched boundaries

A detailed overview of the Tier 2 matching approach is provided in the project Github repository at src/analysis/sandbox/matching/methodology.md, and a brief overview is provided here.

A master list of active, community water systems is derived from SDWIS data. Supporting locational information– such as city served– is joined from supporting SDWIS tables. Federal water facility data (i.e., ECHO, FRS) exist with spatial centroids (latitude/longitude of the water system facility), but no federal sources provide explicit spatial boundaries. While ECHO is a superset of FRS centroids, both data sources are used because they result in improved matching. Many centroids in the ECHO database are poor quality because they are merely the centroid of the county or state of the water system. In these cases, where available, zip code centroids are substituted based on the zipcode served data provided for select water systems in the UCMR dataset or for mobile homes, the MHP dataset.

We match water system name, city served name, and spatial attributes of these data (centroids) to US Census Place spatial polygons (TIGER shapefiles), which are assumed to represent reasonable proxy water system boundaries. Matched TIGER polygon boundaries thus represent what we call “Tier 2” boundaries. Multiple matches are possible among the different match strategies, and thus a set of steps outlined in the methodology are taken to assign the best match. Best matches are based on rules that are validated against systems with existing labeled boundaries. Many water systems match TIGER places one-to-one (Tier 2a). Still, multiple water systems can match to the same TIGER place. Where water systems match to the same TIGER boundary, the proxy boundary for those systems is perfectly overlapping. We have picked a best match and relegated the unselected matched systems to Tier 3. A future step will be to use the matching data to census to better locate the centroid for Tier 3 modeling.

The uncertainty of using TIGER polygons as proxies for water system boundaries is evaluated in the next section.


5.1 Uncertainty

We estimate the uncertainty in the Tier 2 matching approach with a Modified Jacard Similarity, defined as the proportion of area overlap between the TIGER match and the underlying water service boundary. We calculate the Modified Jacard Index \(J\) by comparing Tier 2 matched TIGER geometries to explicit Tier 1 geometries.

\[ J = \frac{A_T \cap A_E }{A_E} \] where \(A_T \cap A_E\) is the intersection of the TIGER Place area (\(A_T\)) and the explicit Tier 1 water service boundary area (\(A_E\)). Thus, the quantity \(J\) is a closed interval between 0 and 1, inclusive: \(\{J \space \space | \space \space 0 \le J \le 1\}\). Importantly, \(A_T\) can fall outside of \(A_E\). The Jacard similarity \(J\) measures the proportion of coverage that \(A_T\) provides over \(A_E\).

# Tier 2 TIGER geometries for explicit boundaries
tiger <- path(staging_path, "tigris_places_clean.geojson") %>% 
  st_read(quiet = TRUE) %>% 
  select(matched_bound_geoid = geoid) %>%
  mutate(matched_bound_geoid = as.integer(matched_bound_geoid))

t2 <- temm %>% 
  filter(tier == "Tier 1" & !is.na(matched_bound_geoid)) %>% 
  st_drop_geometry() %>% 
  select(matched_bound_geoid, pwsid) %>% 
  left_join(tiger, on = "matched_bound_geoid") %>% 
  st_as_sf() %>% 
  arrange(pwsid) %>% 
  # put into metric CRS, ensure valid geom for intersection
  st_transform(3310) %>% 
  st_make_valid() 

# Tier 1: explicit boundaries
t1 <- temm %>% 
  filter(
    tier == "Tier 1" & 
    pwsid %in% t2$pwsid
  ) %>% 
  select(pwsid) %>% 
  arrange(pwsid) %>% 
  # put into metric CRS, ensure valid geom for intersection, calc area
  st_transform(3310) %>% 
  st_make_valid() %>% 
  mutate(area_t1 = st_area(geometry)) 

# sanity check: t1 and t2 vectors are identical and ready to be compared
# identical(t1$pwsid, t2$pwsid)

# calculate intersection of t2 over t1 and calculate area
f_intersection <- function(x){
  res = st_intersection(t2[x, ], t1[x, ])
  if(nrow(res) == 0){
    res = 0
  } else {
    res = res %>% 
      mutate(area = st_area(geometry)) %>% 
      pull(area)
  }
  return(res)
}

# perform the intersection - needs to be pairwise, hence the func above
# t1$area_intersection <- map_dbl(1:nrow(t1), ~f_intersection(.x)) # slow

# calculate modified jacard
# t1$jacard <- as.numeric(t1$area_intersection / t1$area_t1)

# load pre-computed result from above b/c the previous 2 lines take time
# write_rds(t1, path(staging_path, "t1_jacard.rds"))
t1 <- read_rds(path(staging_path, "t1_jacard.rds"))

# sanity check that intersection worked
# j = 6
# mapview(t1[j,]) + 
#   mapview(t2[j,], col.regions = "green") + 
#   mapview(st_intersection(t2[j, ], t1[j, ]), col.regions = "red")
# t1$jacard[j]

We calculate that 95.98% of Tier 2 matched Tiger Place boundaries spatially intersect their assigned explicit Tier 1 boundary (when present). The proportion of overlap (Modified Jacard) for the 95.98% of intersecting Tier 1 and 2 geometries is generally left-skewed, which indicates high overlap. A Jacard of 1 indicates complete overlap (i.e., the TIGER Place sufficient covers the explicit water system boundary), a Jacard of 0.1 indicates that only 10% of the Tier 1 explicit boundary is covered by the Tier 2 boundary, and a Jacard of 0 indicates zero percent overlap (i.e., non-intersection).

t1 %>% 
  filter(jacard != 0) %>% 
  ggplot(aes(jacard)) +
  geom_histogram(color = "white") +
  labs(x = "Modified Jacard Similarity") +
  theme_report
Histogram of modified jacard similarity index

Figure 5.1: Histogram of modified jacard similarity index

4.02% of Tier 1 and 2 boundaries do not intersect, however we show that the vast majority of these geometries are very close to one another, which is unsurprising, as matches are constrained within state. The distribution of distance between centroids of non-intersecting matches is shown below.

# a substantial portion (almost half) of matched tier 2 geoms don't intersect
# tier 1 explicit boundaries but are they close by? calculate centroid distances
t1_zero <- t1 %>% 
  filter(jacard == 0) %>% 
  mutate(geometry = st_centroid(geometry)) %>% 
  arrange(pwsid)

t2_zero <- t2 %>% 
  filter(pwsid %in% t1_zero$pwsid) %>% 
  mutate(geometry = st_centroid(geometry)) %>% 
  arrange(pwsid)

# sanity check: t1 and t2 vectors are identical and ready to be compared
# identical(t1_zero$pwsid, t2_zero$pwsid)

# calculate pairwise distances between Tier 1-2 non-intersecting centroids
# with "zero jacard" (zd)
# output distance is in kilometers, because NAD83 is in meters and we divide by 1000
f_distance <- function(x){
  res = st_distance(t1_zero[x, ], t2_zero[x, ])
  return(as.numeric(res)/1000)
}
# zjd <- map_dbl(1:nrow(t1_zero), ~f_distance(.x)) # slow

# load pre-computed result from above because it takes a while to run
# write_rds(zjd, path(staging_path, "zero_dist_non_intersects.rds"))
zjd <- read_rds(path(staging_path, "zero_dist_non_intersects.rds"))

# histogram of distances in miles 
# 1 km = 0.621371 miles
km_to_mi = 0.621371

tibble(dist = zjd) %>% 
  ggplot(aes(dist*km_to_mi)) +
  geom_histogram(color = "white", bins = 200) +
  labs(x = "Distance (mi)") +
  coord_cartesian(xlim = c(0, 500)) + 
  theme_report 
Distance between matched Tier 1 and Tier 2 polygon centroids

Figure 5.2: Distance between matched Tier 1 and Tier 2 polygon centroids

The median distance between matching Tier 1 and 2 geometries is 14.95 mi and the interquartile range is 4.92 - 130.52 mi This suggests that non-spatially intersecting matches are still very close together and hence, may be considered adequate preliminary water service boundaries. However, if spatial accuracy close than a several miles is paramount, it may be determined that Tier 3 boundaries should be used instead of Tier 2 boundaries. In the absence of a Tier 1 boundary, where Tier 2 boundaries do not spatially intersect the provided ECHO centroid, we may decide at a later date to prefer the Tier 3 boundary, for instance, if the distance between the ECHO centroid and matched Tiger Place exceeds some reasonable threshold.

Results suggest that matching water system spatial boundaries and attributes (i.e., name) to TIGER Places is a valid approach to construct “proxy” water system boundaries. These findings are relatively consistent with findings by Patterson et al. (2022), who showed that Census TIGER Places generally agree with municipal boundaries and that these municipal boundaries are valid proxies for water service areas.


6 Tier 3: Modeled boundaries

In the absence of explicit spatial boundaries (Tier 1) and matched TIGER proxy boundaries (Tier 2), we statistically model an estimated boundary (Tier 3). Model specification hinges on the correlation between the radius of Tier 1 convex hulls4 (the response variable), and predictors that explain this response, such as service connection count, population served, ownership type and so on.

We experimented with 3 different models: random forest, xgboost, and multiple linear regression. These different statistical and machine learning models have different assumptions and performance, which are discussed in the sections that follow.

Ultimately, we selected the multiple linear regression model as our final model because it is computationally efficient, easily interpretable, provides confidence intervals to characterize uncertainty in the modeled boundary (this may be useful depending on the application of the model results), and finally because it avoids overfitting5.


6.1 EDA

In this section of the report, we expand on important data pre-processing steps, feature selection/engineering, and developed models.


6.1.1 Right skewed response

The response variable (radius) from the 14 labeled Tier 1 states (AZ, CA, CT, IL, KS, MO, NC, NJ, NM, OK, PA, TX, UT, WA) is right-skewed. The median radius is 0.58 mi, and the maximum radius is 33.05 mi. Linear models assumes constant variance and normality, thus we log-transform the response to stabilize its variance, enable regression and inference approaches, prevent high-leverage radii from exerting too much influence on the model fit, and disallow predicted negative radii.

d %>% 
  ggplot(aes(radius*km_to_mi)) + 
  geom_histogram(bins = 100, col = "white") +
  labs(x = "Radius (mi)") +
  theme_report
Distribution of water system radius in Tier 1 systems

Figure 6.1: Distribution of water system radius in Tier 1 systems

The main drawback of log-transformation is that model coefficient units (e.g., slope of the regression fit) and measures of performance (e.g., RMSE) become difficult to interpret, however there are methods to handle this interpretation6. The benefits of log-transformation outweigh the drawbacks, thus we log-transform the response variable.

Moreover, to avoid over-fitting a model on a training set that over-represents the more common low-radius observations, we perform a stratified random sample on the response variable. This samples equal proportions of training observations from each of quartiles of the radius distribution, delineated by red dashed lines in the figure below.

d %>% 
  # calculate quantiles to show stratified random sampling
  mutate(
    p25 = quantile(radius, 0.25, na.rm = TRUE)*km_to_mi,
    p50 = quantile(radius, 0.50, na.rm = TRUE)*km_to_mi,
    p75 = quantile(radius, 0.75, na.rm = TRUE*km_to_mi)
  ) %>% 
  ggplot(aes(radius*km_to_mi)) + 
  geom_histogram(bins = 100, col= "white") + 
  geom_vline(aes(xintercept = p25), linetype = "dashed", color = "red") +
  geom_vline(aes(xintercept = p50), linetype = "dashed", color = "red") +
  geom_vline(aes(xintercept = p75), linetype = "dashed", color = "red") +
  scale_x_log10() +
  labs(x = "Radius (mi)") +
  theme_report
Distribution of system radii with quartiles

Figure 6.2: Distribution of system radii with quartiles


6.1.2 Correlated predictors

Population served and service connections are highly correlated (\(r\) \(=\) 0.84), even across different variables like owner type code. They also both exhibit log-normality, thus, like the response variable, they are log-transformed, which further impacts the interpretation of the regression fit7. Moreover, these two correlated predictors should not be used together in a linear model (i.e., to avoid multicollinearity which reduces the precision of estimated model coefficients), so we will only use one, and try a model that multiplies both predictors to see if combining them improves the model (albeit at the loss of some interpretability). We can however, use both variables in tree-based methods.

d %>% 
  ggplot(aes(population_served_count, service_connections_count)) + 
  geom_point(aes(color = owner_type_code), alpha = 0.2) +
  rcartocolor::scale_color_carto_d() +
  scale_x_log10() +
  scale_y_log10() +
  labs(x = "Poulation served", 
       y = "Service connections",
       color = "Owner Type Code") +
  theme_report
Service connections vs. population served

Figure 6.3: Service connections vs. population served

Same as above, but broken down by owner type code, it is clear to see that this correlation persists across ownership types. Importantly, although we lack Tier 1 data for Native American systems (owner type code = “N”), the population and service connection scaling is present, and so models trained on different Tier 1 ownership types are likely to generalize well to these systems.

d %>% 
  ggplot(aes(population_served_count, service_connections_count)) + 
  geom_point(aes(color = owner_type_code), alpha = 0.2) +
  rcartocolor::scale_color_carto_d() +
  scale_x_log10() +
  scale_y_log10() +
  facet_wrap(~owner_type_code) + 
  labs(x = "Poulation served", 
       y = "Service connections",
       color = "Owner Type Code") +
  guides(color = "none") +
  theme_report
Service connections vs. population served, faceted

Figure 6.4: Service connections vs. population served, faceted


6.1.3 Interaction effects

Linear regression uses correlations between features of a dataset (called predictors) to predict an outcome (i.e. the response variable). Importantly, the mathematical constraints of regression require that predictors are independent from one another, which is to say, they do not share mutual information. If predictors are not independent and interact with one another, this should be specified in the model to avoid less reliable results with higher standard errors.

Interactions between variables can be evaluated by plotting model predictors against one another and looking for differences in regression slopes.


6.1.3.1 Owner type code

Regression slopes differ for the service connections based on owner type code. This suggests an interaction term between service connection (and population by correlation) and the owner type. Most systems are owned by local governments (L) or private (P) entities. Moreover, Native American (N) systems are poorly represented in Tier 1 data – this is discussed in the Limitations and Recommendations sections and should be addressed in future work8.

d %>% 
  # remove 2 "N" codes that are not meaningful
  filter(owner_type_code != "N") %>% 
  ggplot(aes(service_connections_count, radius*km_to_mi)) + 
  geom_point(alpha = 0.2) + 
  geom_smooth(method = "lm", se = FALSE) +
  facet_wrap(~owner_type_code) + 
  scale_x_log10() +
  scale_y_log10() +
  labs(x = "Service Connections", 
       y = "Radius (mi)") +
  theme_report
Interaction plot: owner type code vs. service connections

Figure 6.5: Interaction plot: owner type code vs. service connections


6.1.3.2 Wholesaler

As before with owner type code, wholesaler status interacts with service connections, thus we include an interaction effect for it in the linear model.

d %>% 
  ggplot(aes(service_connections_count, radius*km_to_mi)) + 
  geom_point(alpha = 0.2) + 
  geom_smooth(method = "lm", se = FALSE) +
  facet_wrap(~is_wholesaler_ind) + 
  scale_x_log10() +
  scale_y_log10() +
  labs(x = "Service Connections", 
       y = "Radius (mi)") +
  theme_report
Interaction plot: wholesaler vs. service connections

Figure 6.6: Interaction plot: wholesaler vs. service connections


6.1.3.3 Service area type code

Service area type codes include all possible service areas of a water system–e.g. a daycare, mobile home park, residential. Because there are many permutations and possibilities, we select the top 4 service area type codes as features for the model, lump everything else into an “Other” category, and train on this feature. This reduces class imbalance across the categories. We include an interaction effect for service area type code.

# clean service area type code column
d <- d %>% 
  mutate(
    # split type codes in the "python list" into chr vectors
    satc = strsplit(service_area_type_code, ", "),
    # map over the list to remove brackets ([]) and quotes (')
    satc = map(satc, ~str_remove_all(.x, "\\[|\\]|'")),
    # sort the resulting chr vector
    satc = map(satc, ~sort(.x)), 
    # collapse the sorted chr vector
    satc = map_chr(satc, ~paste(.x, collapse = ""))
  ) 

After cleaning, there are 708 unique service area type code combinations, and most are “RA” or “MH”, indicating that most water systems serve residential areas and areas that include mobile home parks. Regression slopes appear similar, but with different y intercepts, which suggest that some service area types are larger for an equal service connection count. We may consider cleaning this further in a future iteration if domain knowledge indicates predictive power, although preliminary visual inspection (below) suggests little explanatory power.

# plot service connections v radius and facet by top 3 service area type code
d %>% 
  mutate(
    satc = fct_lump_n(satc, 3), 
    satc = as.character(satc), 
    satc = ifelse(is.na(satc), "Other", satc)
  ) %>%
  ggplot(aes(service_connections_count, radius*km_to_mi)) + 
  geom_point(alpha = 0.2) +
  geom_smooth(method = "lm", se = FALSE) + 
  facet_wrap(~satc) +
  scale_x_log10() +
  scale_y_log10() +
  labs(x = "Service connections", y = "Radius (mi)") +
  theme_report
Interaction plot: service area type code vs. service connections

Figure 6.7: Interaction plot: service area type code vs. service connections


6.2 Models

We fit random forest, xgboost, and multiple linear regression models to a training set of 80% of the data selected by stratified random sampling to avoid overfitting to low-radius observations. We then evaluate model performance on the 20% holdout test set, discuss model uncertainty and limitations, interpret model results, and use the model to generate Tier 3 predictions on unlabeled data not covered by Tier 1 or 2 boundaries.


6.2.1 Random Forest

We first fit a random forest to the Tier 1 labeled data because the algorithm has less strict assumptions compared to regression, it can handle correlated predictors, and because the variable importance output quickly indicates which variables explain the most variance in the dataset. This can confirm our understanding of the relationships in the data explored in the EDA above and inform how to specify subsequent, more interpretable linear models.

We fit the rand forest regression to predict radius from service connections, population, owner type, service area type code, and wholesaler status. Service connections, population served, and owner type code appear important in segmenting the feature space, and service area type code and wholesaler status appear less predictive.

library(tidymodels)
library(vip)

# read full dataset 
df <- read_csv(path(staging_path, "model_input_clean.csv")) 

# unlabeled data (du) and labeled data (dl)
du <- df %>% filter(is.na(radius))
dl <- df %>% filter(!is.na(radius))

# plit labeled data (dl) into train and test with stratified random sampling
# in each of the radius quartiles to account for the lognormal distribution
# of the response variable (radius) and avoid overfitting to small radius obs
set.seed(55)
dl_split <- initial_split(dl, prop = 0.8, strata = radius)
train    <- training(dl_split) 
test     <- testing(dl_split)

# model and workflow
rf_mod <- 
  rand_forest(trees = 1000) %>% 
  set_engine("ranger", importance = "impurity") %>% 
  set_mode("regression")

rf_wflow <- 
  workflow() %>% 
  add_formula(
    radius ~ 
      population_served_count + 
      # importantly, the RF can have correlated predictors, so we add
      # service connections, and don't need to account for interactions
      service_connections_count + 
      # use the cleaned owner type code from preprocess.R, which converts 
      # 2 "N" owner type codes to "M" so that models can evaluate
      owner_type_code_clean +  
      is_wholesaler_ind + 
      satc
  ) %>% 
  add_model(rf_mod) 

# fit the random forest model
rf_fit <- fit(rf_wflow, train)

# show variable importance
rf_fit %>%
  extract_fit_parsnip() %>%
  vip(geom = "point") +
  theme_report
Random forest variable importance

Figure 6.8: Random forest variable importance

Next we fit the model on the test set, plot residuals and calculate an \(R^2\) goodness of fit metric and two standard metrics of error, the RMSE and MAE. For our purposes, the MAE is likely more important because it doesn’t penalize outliers as much as RMSE, and we care more about the overall appropriateness of the boundary than we do about preventing large outlying errors.

# predict on test set
rf_test_res <- test %>% 
  # select(radius) %>% 
  bind_cols(predict(rf_fit, test)) 

# plot residuals
rf_test_res %>% 
  ggplot(aes(log10(radius*km_to_mi), log10(.pred*km_to_mi), color = owner_type_code)) + 
  geom_point(alpha = 0.4) + 
  geom_abline(lty = 2, color = "red") + 
  labs(y = "Predicted radius (log10)", x = "Radius (log10)",
       color = "Owner Type") +
  # scale and size the x- and y-axis uniformly
  coord_obs_pred() +
  theme_report
Random forest test residuals, highlighted by ownership type

Figure 6.9: Random forest test residuals, highlighted by ownership type

# calculate goodness of it and error metrics
rf_metrics <- metric_set(rsq, rmse, mae)
rf_metrics(rf_test_res, truth = log10(radius), estimate = log10(.pred)) %>% 
  select(-.estimator) %>% 
  knitr::kable(col.names = c("Metric", "Estimate")) %>% 
  kableExtra::kable_styling(full_width = FALSE)
Metric Estimate
rsq 0.6365426
rmse 0.3796572
mae 0.3030336


6.2.2 xgboost

Like random forest, xgboost is a tree-based method that is relatively easy to fit, and unconstrained by linear model assumptions. Variable importance metrics were consistent with those examined in the random forest model. We tuned xgboost hyperparamaters across a grid of inputs (see plot below, used to select the best xgboost model) and actually found that – on the holdout test set – it performed worse than the random forest in terms of goodness of fit and error. At first surprising, this result makes sense given the nature of the more flexible xgboost model, the lack of adequate data that a more flexible model like xgboost would benefit from, and the underlying functional form of the relationship we aim to model.

# load hyperparameter tune space
xgb_res <- read_rds(here("src/analysis/sandbox/model_explore/etc/xgb_res.rds"))

# visualize model performance across tuning grid
xgb_res %>%
  collect_metrics() %>%
  filter(.metric == "rsq") %>%
  select(mean, min_n:learn_rate) %>%
  pivot_longer(min_n:learn_rate,
               values_to = "value",
               names_to = "parameter"
  ) %>%
  ggplot(aes(value, mean, color = parameter)) +
  geom_point(alpha = 0.8, show.legend = FALSE) +
  facet_wrap(~parameter, scales = "free_x") +
  labs(x = NULL, y = "rsq") +
  theme_report
XGBoost model performance across tuning grid

Figure 6.10: XGBoost model performance across tuning grid

xgboost is a flexble machine learning model: it tends to outperform random forest when there is sufficient data to segment the feature space and learn from residuals. However, in our example, we lack substantial data with only a few tens of thousands of water systems, and only 3 high importance predictors, 2 of which are highly correlated. Thus, the similar performance of xgboost and random forest may be explained by not being able to improve on random forest, and slightly overfitting to the training set, which a less flexible model like linear regression will not suffer from. In statistical terms, low flexibility models also tend to maintain low variance, meaning they perform similarly across new test sets (i.e., less bias). Finally, because the underlying functional form of the relationship being modeled is strongly linear (see the section on Interaction Effects), there is not a great advantage in straying from linear methods.

# load the best fit model
final_xgb <- read_rds(here("src/analysis/sandbox/model_explore/etc/final_xgb.rds"))

# fit the final xgboost model on training data
xgb_fit <- fit(final_xgb, train)

# show variable importance
xgb_fit %>%
  extract_fit_parsnip() %>%
  vip(geom = "point") +
  theme_report
XGBoost variable importance

Figure 6.11: XGBoost variable importance

# predict on test set
xgb_test_res <- test %>% 
  select(radius) %>% 
  bind_cols(predict(xgb_fit, test)) 

# plot residuals
xgb_test_res %>% 
  ggplot(aes(log10(radius*km_to_mi), log10(.pred*km_to_mi))) + 
  geom_point(alpha = 0.4) + 
  geom_abline(lty = 2, color = "red") + 
  labs(y = "Predicted radius (log10)", x = "Radius (log10)") +
  # scale and size the x- and y-axis uniformly
  coord_obs_pred() +
  theme_report
XGBoost test residuals

Figure 6.12: XGBoost test residuals

# calculate goodness of it and error metrics
xgb_metrics <- metric_set(rsq, rmse, mae)
xgb_metrics(xgb_test_res, truth = log10(radius), estimate = log10(.pred)) %>% 
  select(-.estimator) %>% 
  knitr::kable(col.names = c("Metric", "Estimate")) %>% 
  kableExtra::kable_styling(full_width = FALSE)
Metric Estimate
rsq 0.6341354
rmse 0.3788001
mae 0.3034620


6.2.3 Linear Regression

As explained in the review of random forest and xgboost models above, there is sound rationale for linear regression in the context of this problem. A strong (and intuitive) linear relationship is observed between Tier 1 water system radii and service connections. The linear model fit outperforms other models, is easily interpretable, and provides standard error metrics. The underlying data do not have a high number of observations that machine learning models would benefit from, and the parameter space is also not high in dimensionality. For these reasons, we select the linear model to estimate Tier 3 boundaries.

Unlike the random forest and xgboost models, we are careful to log transform log-normal predictors and response, and specify interaction terms where present. We experimented with different model specifications, including a model which combined the correlated population and service connection variables, however, this led to negligible improvement in the model fit and less interpretable model coefficients – thus in the final model specification we regress only on service connection. Interaction terms are added for owner type, service area type code, and wholesaler status. A simple linear regression on service connections alone has an \(R^2\) of 0.56; including these extra terms substantially improves the model fit (\(R^2 = 0.64\)) and reduces test error. Note that this \(R^2\) went down slightly (from 0.66) with the addition of new labeled data, which changed the model parameters slightly.

# re-read dataset and log transform the response - only for linear model
df <- read_csv(path(staging_path, "model_input_clean.csv")) %>% 
  mutate(radius  = log10(radius),
         # multiply correlated predictors - however, this has negligible 
         # impact on model fit and makes it harder to interpret, so ignore
         density = population_served_count * service_connections_count)

# unlabeled data (du) and labeled data (dl)
du <- df %>% filter(is.na(radius))
dl <- df %>% filter(!is.na(radius))

# split labeled data (dl) into train and test with stratified random sampling
# in each of the radius quartiles to account for the lognormal distribution
# of the response variable (radius) and avoid overfitting to small radius obs
set.seed(55)
dl_split <- initial_split(dl, prop = 0.8, strata = radius)
train    <- training(dl_split) 
test     <- testing(dl_split)

# lm recipe
lm_recipe <- 
  # specify the model - interaction terms come later
  recipe(
    radius ~ 
      service_connections_count + 
      # use the cleaned owner type code from preprocess.R, which converts 
      # 2 "N" owner type codes to "M" so that models can evaluate
      owner_type_code_clean + 
      satc +
      is_wholesaler_ind,
    data = train
  ) %>% 
  # convert predictors to log10
  step_log(service_connections_count, base = 10) %>% 
  # encode categorical variables  
  step_dummy(all_nominal_predictors()) %>% 
  # specify interaction effects
  step_interact(~service_connections_count:starts_with("owner_type_code")) %>%
  step_interact(~service_connections_count:starts_with("satc")) %>%
  step_interact(~service_connections_count:starts_with("is_wholesaler_ind"))

# specify model and engine for linear model and rf
lm_mod <- linear_reg() %>% set_engine("lm")

# lm workflow
lm_wflow <- 
  workflow() %>% 
  add_model(lm_mod) %>% 
  add_recipe(lm_recipe)

# fit the linear model on the training set
lm_fit <- fit(lm_wflow, train)

# predict on the test set and bind mean predictions and CIs
lm_test_res <- test %>% 
  select(radius) %>% 
  bind_cols(predict(lm_fit, test)) %>% 
  bind_cols(predict(lm_fit, test, type = "conf_int"))

# plot residuals
lm_test_res %>% 
  ggplot(aes(radius*km_to_mi, .pred*km_to_mi)) + 
  geom_point(alpha = 0.4) + 
  geom_abline(lty = 2, color = "red") + 
  labs(y = "Predicted radius (log10)", x = "Radius (log10)") +
  # scale and size the x- and y-axis uniformly
  coord_obs_pred() +
  theme_report
Linear model residuals

Figure 6.13: Linear model residuals

On the test set, linear model goodness of fit is comparable to random forest, but error metrics are notably lower.

# calculate goodness of it and error metrics
lm_metrics <- metric_set(rsq, rmse, mae)
lm_metrics(lm_test_res, truth = radius, estimate = .pred) %>% 
  select(-.estimator) %>% 
  knitr::kable(col.names = c("Metric", "Estimate")) %>% 
  kableExtra::kable_styling(full_width = FALSE)
Metric Estimate
rsq 0.6409469
rmse 0.3484414
mae 0.2655540


6.2.3.1 Spatial residuals

We examine model residuals across space (i.e., Tier 3 predictions - Tier 1 explicit labels) to test for spatial autocorrelation in the model residuals. Although we do not calculate standard autocorrelation metrics, a high-level examination of the residuals across states shows that Tier 3 estimates are generally consistent with observed radii (most residuals are near 0 mi - indicating strong agreement and consistent with the 1:1 predicted vs observed plot above). Notable exceptions include MO and OK, which show systematic under-estimation (negative values indicate a smaller estimated radius compared to the observed radius) on the order of around 3 miles.

# Tier 3: Modeled boundaries: read and construct geometry
t3 <- path(staging_path, "tier3_median.geojson") %>% st_read(quiet = TRUE)

# examine residuals
t3 <- t3 %>% 
  filter(!is.na(radius)) %>% 
  mutate(r = .pred - radius)

# plot
t3 %>% 
  mutate(state = str_sub(pwsid, 1, 2)) %>% 
  ggplot() +
  geom_vline(xintercept = 1, color = "black", linetype = "dashed") +
  geom_line(aes(r*km_to_mi, color = state), stat = "density") + 
  rcartocolor::scale_color_carto_d() +
  facet_wrap(~state, scales = "free_y") +
  guides(color = "none") +
  labs(x = "Residual (km)", y = "Density") +
  theme_report
Spatial residuals (Tier 1 states)

(#fig:lm spatial-autocorrelation-residuals)Spatial residuals (Tier 1 states)


6.2.3.2 Confidence intervals

Linear model 5 and 95% confidence intervals are calculated and provided as an output with the median radius. Importantly, the range of these predictions is relatively narrow, and generally less than 0.6 mi.

t<- df %>% 
  select(pwsid, radius) %>% 
  bind_cols(predict(lm_fit, df)) %>% 
  bind_cols(predict(lm_fit, df, type = "conf_int", level = 0.95)) %>% 
  mutate(
    # exponentiate results back to median (unbiased), and 5/95 CIs
    across(where(is.numeric), ~10^(.x)),
    # calculate CI range to plot distribution
    .pred_range = .pred_upper - .pred_lower
  ) %>% 
  ggplot(aes((.pred_range)*km_to_mi)) +
  geom_histogram(bins = 100, color = "white") +
  #coord_cartesian(xlim = c(0, 2)) +
  labs(x = "CI range (mi)") +
  theme_report

Below is an example of one Tier 3 with a typical CI range (around 0.6 mi). Toggle to the Esri.WorldImagery basemap to view satellite imagery of the underlying location and compare this to the estimated range of the Tier 3 estimates.

# lm CI layers
t3m_med <- st_read(path(staging_path, "tier3_median.geojson"), quiet = TRUE)
t3m_cil <- st_read(path(staging_path, "tier3_ci_upper_95.geojson"), quiet = TRUE)
t3m_ciu <- st_read(path(staging_path, "tier3_ci_lower_05.geojson"), quiet = TRUE)

# plot examples
pwsid_select <- "IN5289012" # ~1km CI range
mapview(filter(t3m_ciu, pwsid == pwsid_select), 
        col.regions = "green", layer.name = "upper CI", 
         alpha.regions = 0.3) +
mapview(filter(t3m_med, pwsid == pwsid_select), 
        col.regions = "blue", layer.name = "median",
        alpha.regions = 0.3) +
  mapview(filter(t3m_cil, pwsid == pwsid_select), 
          col.regions = "red", layer.name = "lower CI",
          alpha.regions = 0.3) 


6.2.3.3 Model Coefficients

In the linear model specified above, when both the dependent (response) variable and independent (predictor) variables are log-transformed we interpret these coefficients as the percent increase in the predictor for every 1% increase in the response. In our linear model, only the service connections are log transformed, thus we interpret the service_connections_count coefficient (0.37) as follows: for every a 1% increase in the water system radius, we see a 0.37% increase in service connection count. Thus, a 25%, 50%, and 75% increase in the radius are associated with a 9%, 17%, and 24% increase in service connection count, respectively.

It is also worth noting the high prevalence of high p-value coefficients, which suggests that the coefficients of these variables would not have a substantial effect on the response variable9. Most interaction term coefficients fall into this category. By contrast, low p-value coefficients indicate predictors that, when changed, result in a change in the response variable.

lm_fit %>% 
  extract_fit_engine() %>% 
  tidy() %>% 
  knitr::kable(caption = "Linear model coefficients for water system radii") %>% 
  kableExtra::kable_styling(full_width = FALSE)  
Table 6.1: Linear model coefficients for water system radii
term estimate std.error statistic p.value
(Intercept) 1.7681553 0.1517673 11.6504391 0.0000000
service_connections_count 0.3714172 0.0639681 5.8062867 0.0000000
owner_type_code_clean_L 0.1540769 0.1407301 1.0948398 0.2736074
owner_type_code_clean_M -0.2706074 0.1623434 -1.6668830 0.0955621
owner_type_code_clean_P -0.3706222 0.1405468 -2.6370022 0.0083743
owner_type_code_clean_S 0.3463554 0.1818441 1.9046837 0.0568435
satc_MU 0.0931241 0.0785126 1.1861037 0.2356033
satc_Other 0.1323838 0.0646641 2.0472522 0.0406536
satc_RA 0.2857419 0.0596160 4.7930440 0.0000017
is_wholesaler_ind_wholesaler 0.2269642 0.0462001 4.9126293 0.0000009
service_connections_count_x_owner_type_code_clean_L -0.0553213 0.0548959 -1.0077496 0.3135937
service_connections_count_x_owner_type_code_clean_M -0.0033984 0.0670256 -0.0507027 0.9595632
service_connections_count_x_owner_type_code_clean_P 0.0635170 0.0551061 1.1526307 0.2490835
service_connections_count_x_owner_type_code_clean_S -0.2002167 0.0725094 -2.7612504 0.0057662
service_connections_count_x_satc_MU 0.0748175 0.0374844 1.9959644 0.0459589
service_connections_count_x_satc_Other 0.1052148 0.0346646 3.0352255 0.0024084
service_connections_count_x_satc_RA 0.0663033 0.0333759 1.9865620 0.0469922
service_connections_count_x_is_wholesaler_ind_wholesaler -0.0787670 0.0148820 -5.2927573 0.0000001


7 Combine Tiers

After boundaries from all three Tiers are assimilated, matched, or modeled, they are combined into a final spatial data layer. Tier 1 boundaries always supersede Tier 2 boundaries, which always supersede Tier 3 boundaries, as depicted in the conceptual model in the Introduction. The combined TEMM data layer fulfills the motivation of this work, thus we characterize it in the Key Results.


8 Limitations

No model is without limitations. In this section, we highlight key limitations and identify potential future pathways to improve the model fit through additional data collection and/or cleaning.


8.1 The “Pancake Problem”

# filter for pancake stacks
pancakes <- a %>% 
  left_join(a %>%
              select(pwsid, tier, county_served)) %>%
  filter(centroid_quality == "COUNTY CENTROID" & 
         tier == 'Tier 3') 

# counties of big pancakes
big_pancakes <- pancakes %>% 
  count(county_served, primacy_agency_code, sort = TRUE) 

# plot
big_pancake_pwsid <- pancakes %>% 
  filter(county_served == big_pancakes$county_served[1] &
           primacy_agency_code == big_pancakes$primacy_agency_code[1]) %>% 
  pull(pwsid)

The “pancake problem” is fundamentally caused by low centroid accuracy (e.g., country or state centroid accuracy), and impacts PWSIDs that lack an explicit boundary (or TIGRIS match). Estimated radii for these systems result in circular buffers that overlap, like a stack of pancakes. To illustrate, we query for the biggest pancake problem (greatest number of pancakes in a single stack), which belongs to the county of WAKE in NC and has 206 Tier 3 predictions.

# visualize
temm %>% 
  filter(pwsid %in% big_pancake_pwsid) %>% 
  mapview(zcol = "pwsid")
# calculate count and proportion of pancakes
nrow_tier3 <- temm %>% filter(tier == "Tier 3") %>% nrow()

pancake_count_prop <- temm %>% 
  st_drop_geometry() %>% 
  filter(tier == "Tier 3") %>%
  group_by(centroid_quality) %>% 
  summarise(
    count = n(),
    percent_tier3 = round((count/nrow_tier3), 3)*100
  ) %>% 
  arrange(desc(percent_tier3)) 

Next, we quantify the count and percent of “geometry quality” values for all Tier 3 spatial boundaries. State centroid “geometry quality” (1.3% of Tier 3 data) is an extreme example of the centroid accuracy issues observed in the dataset. County centroids (41.9% of Tier 3 data) are more resolved than state-level centroids and are the most common scale at which pancake stacks form. Zip code centroids (21.3% of Tier 3 data) form a large proportion of Tier 3 data, but pancakes at this scale are not as inaccurate as county or state centroids and may be appropriately acceptable error for most analytical purposes. Finally, not all “geometry quality” values are associated with pancakes (for example, “GPS - UNSPECIFIED” points, when plotted, clearly do not form pancake stacks), but these make up a small proportion of Tier 3 data. Most Tier 3 systems suffer from the pancake problem.

pancake_count_prop %>% 
  knitr::kable(col.names = c("Geometry Quality", 
                             "Count", 
                             "Percent"), 
               align = rep("l", 3),
               caption = "Geometry quality of centroids systems with Tier 3 boundary") %>% 
  kableExtra::kable_styling(full_width = FALSE)  
Table 8.1: Geometry quality of centroids systems with Tier 3 boundary
Geometry Quality Count Percent
COUNTY CENTROID 7349 41.9
ZIP CODE CENTROID 3730 21.3
ADDRESS MATCHING-HOUSE NUMBER 2994 17.1
INTERPOLATION-PHOTO 1210 6.9
IMAGERY ONLY 827 4.7
BOTH INTERNET AND IMAGERY 350 2.0
PLACE NAME CENTROID 248 1.4
STATE CENTROID 228 1.3
NA 185 1.1
GPS - UNSPECIFIED 93 0.5
INTERPOLATION-MAP 66 0.4
ADDRESS MATCHING-BLOCK FACE 49 0.3
POST OFFICE NAME CENTROID 56 0.3
GPS CODE (PSEUDO RANGE) DIFFERENTIAL 35 0.2
ADDRESS MATCHING-NEAREST INTERSECTION 12 0.1
GDT-ADDRESS MATCHING (GEOCODING) 9 0.1
GPS CARRIER PHASE STATIC RELATIVE POSITION 14 0.1
UNKNOWN 24 0.1
ADDRESS MATCHING-OTHER 1 0.0
ADDRESS MATCHING-STREET CENTERLINE 6 0.0
CLASSICAL SURVEYING TECHNIQUES 3 0.0
GPS CODE (PSEUDO RANGE) PRECISE POSITION 3 0.0
GPS CODE (PSEUDO RANGE) STANDARD POSITION (SA OFF) 1 0.0
INTERPOLATION - DIGITAL MAP SRCE (TIGER) 3 0.0
INTERPOLATION-OTHER 6 0.0
INTERPOLATION-SATELLITE 7 0.0
PUBLIC LAND SURVEY-SECTION 1 0.0
THE GEOGRAPHIC COORDINATE DETERMINATION METHOD BASED ON ADDRESS MATCHING 2 0.0
THE GEOGRAPHIC COORDINATE DETERMINATION METHOD BASED ON GPS 3 0.0
THE GEOGRAPHIC COORDINATE DETERMINATION METHOD BASED ON INTERPOLATION 2 0.0
THE GEOGRAPHIC COORDINATE DETERMINATION METHOD BASED ON PUBLIC LAND SURVEY 1 0.0
US BUREAU OF CENSUS BLOCK ESTABLISHED FOR YEAR NOTED 1 0.0

Upstream centroid accuracy issues are beyond the scope of this study, and may be addressed in coordination with federal agencies that produce FRS and ECHO databases. A recommendation for how to address the pancake problem is outlined in the Recommendations section below.


8.2 Native American data gaps

We lack labeled Tier 1 data and hence radii for tribal primacy types (and territories, although this analysis excludes them), and can’t improve the model based on this information until it is collected. The model therefore, has no information on “Tribal” primacy types and all inferences are based on “State” primacy types. This may introduce error in modeled boundaries if it is found that system boundaries systematically vary based on the “primacy type” parameter.

d %>% 
  group_by(primacy_type) %>% 
  filter(tier == "Tier 1") %>%
  # Filter for Tier 1
  summarise(proportion_present = sum(!is.na(radius))/n(),
            proportion_present = round(proportion_present, 2)) %>% 
  knitr::kable(col.names = c("Primacy Type", "Proportion Present"),
               caption = "Proportion of primacy type for water systems in Tier 1") %>% 
  kableExtra::kable_styling(full_width = FALSE)
Table 8.2: Proportion of primacy type for water systems in Tier 1
Primacy Type Proportion Present
State 1


9 Recommendations

Beyond a discussion of how the TEMM spatial layer may be used (covered in the next section, Contributions) we advise the following improvements to the pipeline for creating the TEMM spatial layer. In order of importance:

  1. Data collection and refinement
    • Collect additional Tier 1 data. Explicit boundary data supersedes Tier 2 TIGER proxy matches and Tier 3 modeled water system boundaries. Moreover, additional Tier 1 data improves Tier 3 model predictions by providing more training data for similar systems with missing boundary data or a matching TIGER place. In theory, explicit boundary data for each state across the USA would supersede all Tier 2 and 3 matches and estimates and reduce the pipeline presented herein to a simple data cleaning and assimilation task. Potential buffering of state water line data to approximate water systems is another potential option for states with this kind of data, e.g. Kentucky.
    • Gather labeled boundaries for tribal systems. Tribal systems are currently underrepresented (and territories if coverage is to be extended to these regions). Coordination should target the collection of Tier 1 data for these systems.
    • Gather additional data to support technical efforts. In particular, centroids of cities and towns and state-provided municipal boundaries may be used to address the pancake problem and improve the matching algorithm, respectively (described below). Another area for data improvement is to fill in data gaps around population or connections served that result in the loss of systems with missing data.
  2. Data pipeline and methodological improvements
    • Address the “Pancake Problem”. Low-accuracy centroids challenge the usefulness of the TEMM layer and introduce uncertainty into analyses that depend on these data. An approach to re-center low-accuracy centroids (i.e., state and county level centroid accuracy) may be to “spatially dis-aggregate” or “spread out” the pancakes to more reasonable locations, like the centroids of cities and towns that fall within the county/state (depending on the centroid accuracy). Candidate cities for pancakes may be identified via a name match.
    • Improve “Match” and “Model” Tiers 2-3. Improvements may be made to Tier 2 matching and Tier 3 modeling. Specifically, matching may be improved by using state-provided municipal boundaries, which Patterson et al. (2022) showed to improve upon Census Place boundaries. Addressing multiple systems matching to the same Census Place is another area for improvement, to reduce the number of systems with overlapping boundaries. A further match rule might be incorporated to assign only one system to the full TIGER boundary, and apply Tier 3 model to the remaining systems.
    • Develop rules to handle boundary assignment for county-level systems, private systems, or water districts, for which Tier 2 boundaries may be less appropriate according to Patterson et al. (2022); such efforts would require substantial work to first categorize the system types, since no database disaggregating public or private systems by more nuanced types readily exists.
    • Create tools to visualize matches. A preliminary, minimal interactive Shiny App was developed to visualize candidate matches and support Tier 2 algorithm development. This tool may be expanded to further support iterative review of changes to the algorithm and better understand how match rules determine the end result.
    • Web hosting of the TEMM spatial layer. The developed TEMM spatial layer will be made available online as a flat file (i.e., .geojson, .shp, .csv) which is accessible to analysts. To make the results accessible to non-technical people and the general public, one option is to stand it up on a server for interactive use in a web browser.


10 Contributions

We developed a “Tiered Explicit, Match, and Model” (TEMM) approach to compile a nationwide water system boundary layer via an open source data pipeline. Such a dataset is unprecedented: presently, easily-accessible, machine-readable, and clean water system spatial boundary data – if available online – is siloed across states. This work brings together multiple spatial datasets where they exist, and develops algorithms and models to “fill in” missing data where it does not exist.

The spatial layer covers 44,786 community water systems and a total population of 307,108,553. Most people (307 million, 100% of the population) are covered by a Tier 1 or 2 spatial boundary, which have relatively high accuracy.

The data pipeline developed in this work can dynamically regenerate the TEMM spatial layer based on the addition of new Tier 1 explicit boundaries, refinement of the Tier 2 matching algorithm, improvements to the Tier 3 modeled boundaries, and changes to any other data dependencies (e.g., improved centroid location to solve the pancake problem). This pipeline is provided under the MIT License online at an open source Github repository along with output TEMM spatial layer results in geojson, shapefile, and csv format. Contributor guidelines are available on Github, and detail the processes by which contributions may be made to the project.

We imagine that the TEMM spatial layer evolving into an appropriate as an input for any analyses that depend on nationwide water service coverage boundaries.


11 Footnotes


  1. In a previous report (eda_february.html,“Section 6 TIGER proxy polygon appropriateness”), we show that matching PWSID to TIGER places is highly accurate: matched TIGER place polygons intersected explicit water service boundary polygons 93.16% of the time. In this report, we add more states, modify the matching algorithm, and further quantify this boolean (TRUE/FALSE) spatial intersection in terms of “percent overlap”. Thus, this older metric should be disregarded in light of new developments described herein.↩︎

  2. We drop 3464/ (7.01%) of water systems from this analysis for one of three reasons: (i) they lack a latitude or longitude (and hence we cannot “center” them anywhere), (ii) they fall outside of the 50 US States (e.g., territories like Puerto Rico, Guam, American Samoa), or (iii) their reported population or service connection count is less than 15, which is the minimum service connection count required for a system to be classified as a “community water system”.↩︎

  3. Some water systems were excluded from this data layer. Specifically, 4,526 water systems were dropped because they either had reported service connections less than 15 or population count less than 25 or were associated with US territories; most excluded water systems in this step are found in Puerto Rico. Future efforts should refine service connection and population data in these systems.↩︎

  4. Labeled radii are calculated from the convex hull area of Tier 1 systems instead of simple water system area because they better represent calculated Tier 3 circular buffers.↩︎

  5. “Overfitting” is the tendency of a statistical or machine learning model to fit well to “seen” training data, but generalize poorly to “unseen” testing data. We need the model to generalize well to unseen water systems. The linear model actually results in the lowest error of the models tested, which may at first seem surprising, but is actually not considering the strong linear relationship between water system radius and service connections. One might expect machine learning approaches like Random Forest and xgboost to outperform the linear model. In fact, they do on training sets, but tend to overfit the training set and not not generalize as well to test sets. Simply put, we to not have enough of the high-dimensional data that machine learning models require to outperform classical approaches like linear regression in the context of this problem.↩︎

  6. When only the dependent/response variable is log-transformed: exponentiate the coefficient, subtract one from this number, and multiply by 100. This gives the percent increase (or decrease) in the response for every one-unit increase in the independent variable. Example: the coefficient is 0.198. (exp(0.198) – 1) * 100 = 21.9. For every one-unit increase in the independent variable, our dependent variable increases by about 22%.↩︎

  7. Both dependent/response variable and independent/predictor variable(s) are log-transformed: Interpret the coefficient as the percent increase in the dependent variable for every 1% increase in the independent variable. Example: the coefficient is 0.198. For every 1% increase in the independent variable, our dependent variable increases by about 0.20%. For x percent increase, calculate 1.x to the power of the coefficient, subtract 1, and multiply by 100. Example: For every 20% increase in the independent variable, our dependent variable increases by about (1.20 0.198 – 1) * 100 = 3.7 percent.↩︎

  8. There are 2 Tier 1 native american systems, which heavily impacts modeling: meaningful models can’t be fit to data with only two observations. Thus, we re-code the 2 existing “N” systems to “M” to train the Tier 3 models. With the addition of an appreciable number (e.g., > 50) labeled Tier 1 Native American boundaries, we can revert this re-coding step.↩︎

  9. The p-value in a linear regression tests the null hypothesis that the coefficient has no effect (i.e., coefficient = zero). A low p-value indicates evidence for rejecting the null hypothesis and a high p-value indicates otherwise.↩︎